A machine learning approach to create blocking criteria for record linkage.
نویسنده
چکیده
Record linkage, a part of data cleaning, is recognized as one of most expensive steps in data warehousing. Most record linkage (RL) systems employ a strategy of using blocking filters to reduce the number of pairs to be matched. A blocking filter consists of a number of blocking criteria. Until recently, blocking criteria are selected manually by domain experts. This paper proposes a new method to automatically learn efficient blocking criteria for record linkage. Our method addresses the lack of sufficient labeled data for training. Unlike previous works, we do not consider a blocking filter in isolation but in the context of an accompanying matcher which is employed after the blocking filter. We show that given such a matcher, the labels (assigned to record pairs) that are relevant for learning are the labels assigned by the matcher (link/nonlink), not the labels assigned objectively (match/unmatch). This conclusion allows us to generate an unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a Disjunctive Normal Form (DNF) learning problem and use the Probably Approximately Correct (PAC) learning theory to guide the development of algorithm to search for blocking filters. We test the algorithm on a real patient master file of 2.18 million records. The experimental results show that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but reduce throughput (runtime) by an order-of-magnitude factor.
منابع مشابه
Learning Blocking Schemes for Record Linkage
Record linkage is the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is impractical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For example, in a person matching problem, ...
متن کاملAdaptive and Flexible Blocking for Record Linkage Tasks
In data integration tasks, records from a single dataset or from different sources must often be compared to identify records that represent the same real world entity. The cost of this search process for finding duplicate records grows quadratically as the number of records available in the data sources increases and, for this reason, direct approaches, such as comparing all record pairs, must...
متن کاملLeveraging Unlabeled Data to Scale Blocking for Record Linkage
Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the...
متن کاملDynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce
Record Linkage is the task of identifying which records in a database refer to the same entity. A standard machine learning approach to this problem is to train a model that assigns scores to pairs of records where pairs scoring above a threshold are said to represent the same entity. However, it is too expensive to make pairwise comparisons among all records in large databases. “Blocking” is t...
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Health care management science
دوره 18 1 شماره
صفحات -
تاریخ انتشار 2015